N-Grams and Morphological Normalization in Text Classification: A Comparison on a Croatian-English Parallel Corpus

نویسندگان

  • Artur Silic
  • Jean-Hugues Chauchat
  • Bojana Dalbelo Basic
  • Annie Morin
چکیده

In this paper we compare n-grams and morphological normalization, two inherently different text-preprocessing methods, used for text classification on a Croatian-English parallel corpus. Our approach to comparing different text preprocessing techniques is based on measuring computational performance (execution time and memory consumption), as well as classification performance. We show that although n-grams achieve classifier performance comparable to traditional word-based feature extraction and can act as a substitute for morphological normalization, they are computationally much more demanding.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparing k-means clusters on parallel Persian-English corpus

This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...

متن کامل

Building the Croatian-English Parallel Corpus

The contribution gives a survey of procedures and formats used in building the Croatian-English parallel corpus which is being collected in the Institute of Linguistics at the Philosophical Faculty, University of Zagreb. The primary text source is newspaper Croatia Weekly which has been published from the beginning of 1998 by HIKZ (Croatian Institute for Information and Culture). After quick su...

متن کامل

Initial Comparison of Linguistic Networks Measures for Parallel Texts

In this paper we compared the properties of linguistic networks for Croatian, English and Italian languages. We constructed co-occurrence networks from parallel text corpora, consisting of the translations of five books in the three languages. We generated an Erdös-Rényi random graph with the same number of nodes and links, which enabled the comparison with linguistic co-occurrence networks, sh...

متن کامل

Dealing with Data Sparseness in SMT with Factured Models and Morphological Expansion: a Case Study on Croatian

This paper describes our experience using available linguistic resources for Croatian in order to address data sparseness when building an English-to-Croatian general domain phrasebased statistical machine translation system. We report the results obtained with factored translation models and morphological expansion, highlight the impact of the algorithm used for tagging the corpora, and show t...

متن کامل

Evaluation of Croatian Word Embeddings

Croatian is poorly resourced and highly inflected language from Slavic language family. Nowadays, research is focusing mostly on English. We created a new word analogy corpus based on the original English Word2vec word analogy corpus and added some of the specific linguistic aspects from Croatian language. Next, we created Croatian WordSim353 and RG65 corpora for a basic evaluation of word simi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007